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We introduce a class of learning problems where the agent is presented 
with a series of tasks. Intuitively, if there is a relation among those tasks, 
then the information gained during execution of one task has value for the 
execution of another task. Consequently, the agent is intrinsically moti- 
vated to explore its environment beyond the degree necessary to solve 
the current task it has at hand. Thus, in some sense, the model explain 
the necessity of curiosity. We develop a decision theoretic setting that 
generalises standard reinforcement learning tasks and captures this intu- 
ition. More precisely, we define a sparse reward process, as a multi-stage 
stochastic game between a learning agent and an opponent. The agent 
acts in an unknown environment, according to a utility that is arbitrarily 
selected by the opponent. Apart from formally describing the setting, we 
link it to bandit problems, bandits with covariates and factored MDPs. 
Finally, we examine the behaviour of a number of learning algorithms in 
such a setting, both experimentally and theoretically. 



1 Introduction 

This paper introduces sparse reward processes. These capture the problem 
of acting in an unknown environment, with an arbitrary unknown sequence of 
future objectives. The question is: how to act so as to perform well in the current 
objective, while at the same time acquiring knowledge that might be useful for 
future objectives? It is thus analogous to a number of real- world problems with 
high uncertainty about future tasks, as well as the more philosophical problem 
of motivating the utility of curiosity in human behaviour. 

We formulate this setting in terms of a multi-stage game between a learn- 
ing agent and an opponent of unknown type. The agent acts in an unknown 
Markovian environment, which is the same in every stage. At the beginning of 
each stage, a payoff function is chosen by the opponent, which determines the 
agent's utility for that stage. The agent must act not only so as to maximise 
expected utility at each stage, but also so that he can be better prepared for 
whatever payoff function the opponent will select at the next stage. 

We call such problems sparse reward processes, because of two types of 
sparseness. The first refers to payoff scarcity: the payoff available at every 
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stage is bounded, while the agent wants to maximise the total payoff across 
stages. The second refers to the fact that the payoff function is sparse for an 
adversarial opponent. We posit that this is a good model of life-long learning 
in uncertain environments, where while resources must be spent learning about 
currently important tasks, there is also the need to allocate effort towards learn- 
ing about aspects of the world which are not relevant at the moment. This is 
due to the fact that unpredictable future events may lead to a change of pri- 
orities for the decision maker. Thus, in some sense, the model "explains" the 
necessity of curiosity. 

While our main contribution is the introduction of the problem, we also 
analyse some basic properties. We show that when the opponent is nature, the 
problem becomes an unknown MDP. For adversarial opponents, a good strategy 
for a two-stage version of the game is to maximise the information gain with 
respect to the MDP mode l, linking our formul ation to exploration heuristics such 



as compression progress [Schmidhuberl . Il99l| , info rmation gain ILin dlcy 



and approximations to the value of information [Roller and Friedman 



1956] 



2009, 



Sec. 23.7]. F or the general adversarial case, we show that either sa mpling from 



the posterior IStrensI 1200011 or confidence-bound based approaches [Auer et al 



20021 IJacksh et all l201pj " perform well compared to a greedy policy. However, 



when the opponent is nature, a greedy policy performs very well, as the payoff 
stochasticity forces the agent to explore. 

The next section introduces the setting and formalises the environment, the 
payoff, the policy and the complete sparse reward process. Sec. [3] examines the 
properties that arise for the two opponent types: nature and adversarial. Sec.|4] 
briefly explains two algorithms for acting in SRPs, derived from two well-known 
reinforcement learning exploration algorithms based on confidence bounds and 
Bayesian sampling respectively. The experimental setup is described in Sec. El 
while Sec. [6] concludes the paper with a discussion of related work and links to 
other problems in reinforcement learning and decision theory. 



2 Setting 

The setting can be formalised as a multi-stage game between the agent and an 
opponent, on a stochastic environment v. At the beginning of the fc-th stage the 
opponent chooses a payoff p^, which he reveals to the agent, who then selects 
an arbitrary policy 7Tj, . It then acts in v using ixy. until the current stage enters a 
terminating state. This interaction results in a random sequence of state visits 
s, whose utility for the agent is pk{s). The agent's goal to minimise the total 
expected regret J2k ^fe* — ^fe> w here Vk — V(pk,Kk) is the expected utility and 
V fe * = sup w V(/dfc,7r) is the maximum expected utility for that stage. 

If the dynamics are known to the agent, then selecting Wk maximising the 
total expected payoff, only requires playing the optimal strategy for each stage 
and disregarding the remaining stages. When v is unknown, however, learning 
about the environment is important for performing well in the later stages. The 
setting then becomes an interesting special case of the exploration-exploitation 
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problem. 



2.1 The environment 

At every stage, the agent is acting within an unknown environment. We as- 
sume that the opponent, has no control over the environment's dynamics and 
that these are constant throughout all stages. More specifically, we define the 
environment to be a controlled Markov process: 

Definition 1. A controlled Markov process (CMP) v is a tuple v — (S,A,T), 
with state space S, action space A, and transition kernel 

T = {t(- \s,a) | s £ S,a e A} , 

indexed in S x A such that r(- | s,a) is a probability measur$\ on S. If at time 
t the environment is in state St G S and the agent chooses action at G A, then 
the next state s t +i is drawn with a probability independent of previous states 
and actions: 

P„(«t+i G S | s t ,a t )=r(S | s t ,a t ) ScS. (2.1) 
Finally, we shall use M for the class of CMPs. 

In the above, and throughout the text, we use the following conventions. We 
employ P„ to denote the probability of events under a process v, while we use 
s l = s%, . . . , St and a* = a%, . . . , at to represent sequences of variables. Similarly 
S l denotes product spaces, and S* = IJ^o ^* denotes the set of all sequences 
of states. Arbitrary-length sequences in S* will be denoted by s. 

Throughout this paper, we assume that the transition kernel is not known 
to the agent, who must estimate it through interaction. On the other hand, 
the payoff function, chosen by the opponent, is revealed to the agent at the 
beginning of each stage. 

2.2 The payoff 

At the fc-th stage, a payoff function pk ■ S* — > [0, 1] is chosen by the opponent. 
This encodes how desirable a state sequence is to the agent for the task. In 
particular if s, s' G S* are two state sequences, then s is preferred to s' in 
round k if and only if pfe(s) > pk(s'). As a simple example, consider a p such 
that, sequences s going through a certain state s* have a payoff of 1, while the 
remaining have a payoff of 0. 

The usual reinforcement learning (RL) setting can be easily map ped to this . 
Reca ll that in RL the agent is acting in a Markov decision process [Puterman . 
|2005| \x (MDP). This is a CMP equipped with a set of distributions { | s) \ s G S } 
on rewards r t G R. In the infinite-horizon, discounted reward setting, the utility 

1 We assume the measurability of all sets with respect to some appropriate cr-algebra. This 
will usually be the Borel algebra B(X) of the set X. 



3 



is defined as the discounted sum of rewards 7*7*4, where 7 € [0, 1] is a discount 
factor. We can map this to our framework, by setting: p(s T ) — J2t=i 7* ^( r * 
Sj) = J2t=i 7* J^ 00 r ^( r I s <) to be the payoff for a state sequence s T . While 
the theoretical development applies to general payoff functions, the experimental 
results and algorithms use the RL setting. 

2.3 The policy 

After the payoff pk is revealed to the decision maker, he chooses a policy TTk, 
which he uses to interact with the environment. The CMP v = (S,A,T) and 
the payoff function jointly define an MDP, denoted by p% = (S, A, T,pk)- The 
agent's policy TTk selects actions with distribution iTk{at | s ), meaning that the 
policy is not necessarily stationary. Together with the Markov decision process 
Pk, it defines a distribution on the sequence of states, such that: 

e S I s l ) = f t(S\ a,s t )dir(a | «*). 

J A 

This interaction results in a sequence of states s, whose utility to the agent is: 
Uk — pfc(s), s 6 S* . Since the sequence of states is stochastic, we set the value 
of each stage to the expected utility: 

V k 4 V(p k , 7r fc ) 4 E^ttj. f/fe = /" p fe (s) dP;(s), (2.2) 

where is the probability measure on S* resulting from using policy n on 
CMP v. Finally, let us define the oracle policy for stage k: 

Definition 2 (Oracle stage policy). Given the process v and the payoff pk at 
stage k, the oracle policy is 7r*(^, pp.) = argmax^ J s „ pk(&) dPJ(s). 

This policy is normally unattainable by the agent, since v is unknown. The 
agent's goal is to minimise the total expected regret □ relative to the oracle: 

K 

c K = Y, v 0>^iy,Pk))-v k (2.3) 

2.4 Sparse reward processes 

The complete sparse reward process is a special case of a stochastic game. How- 
ever, we are particularly interested in processes where only few states have 
payoffs. We model this by mapping each payoff function to a finite measure on 
S*. 

2 We use a slightly differ ent notion of regret from previous work. Instead of using the total 
accumulated reward, as in [Jacksh et all |2010I | . we consider the total expected utility across 
stages. But, if one were to see the payoff obtained at every stage as the reward, the two 
measures of regret would be equivalent. 
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Definition 3. A sparse reward process is a multi-stage stochastic game with 
K stages, where the k-th stage is a Markov decision process pk — {S,A,T,pk)> 
whose payoff function pk : S* — >• [0,1], is revealed to the agent immediately 
after k — 1 stage is complete. The agent chooses policy iik, with expected utility 
Vk — V(pk,^k)- The Markov decision process terminates at time t the stage 
ends, if s t is a terminal state and with fixed termination probability q if s t is 
not a terminal state. 

The process is called a -sparse for a measure a on S* if for every pk € TL, the 
payoff measure Xk on S* , defined as Xk(S) = J s pfc(s) dcr(s), V5 C S* , satisfies 
Xk(S*) < 1. The agent's goal is to find a sequence nk maximising Y2k=i ^k- 

The termination probability q is equivale nt to an infinite ho rizon discounted 
reward reinforcement learning problem [see Puterman , 2005| . Bounding the 



total payoff forces the rewards available at most state sequences to be small 
(though not necessarily zero). Finally, it ensures that the opponent cannot 
place arbitrarily large rewards in certain parts of the space, and so cannot make 
the regret arbitrarily large. Throughout the paper, wc take a to be the uniform 
measure. The construction also enables much of the subsequent development, 
through the following lemma: 

Lemma 1. Given a payoff function p for which there exists a payoff measure A 
satisfying the conditions of Def.\3\for some a, the utility of any policy ir on the 
MDP p = (f,p), can be written as: 

E^U = [ jv„(s)dA(s), (2.4) 
Js* 

where p^.v is the probability (density) of s (with respect to a) under the pol- 
icy 7r and the environment v. We assume that p„ tV always exists, but is not 
necessarfily finite. 

Proof. Via change of measure: EU = fg(s) dP 7r:!/ (s) = f^(fi)p ViV (s) da(s) = 
/^(s)dA(s). " " □ 



3 Properties 

The optimality of an agent policy depends on the assumptions made about the 
opponent. In a worst-case setting, it is natural to view each stage as a zero-sum 
game, where the agent's regret is the opponent's gain. If the opponent is nature, 
then the sparse reward process can be seen as an MDP. This is also true in the 
case where we employ a prior over the opponent's type. 



3.1 When the opponent is nature 

Consider the case when the opponent selects the payoffs pk by drawing them 
from some fixed, but unknown distribution with measure </>(• | 9), parametrised 
by 6 e 0, such that: P(p k 6 B) = 4>(B | 0), Vfc € { 1, 2, . . . , K } , MB C K. 
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In that case, the Bayes-optimal strategy for the agent is t o maintain a be lief 
on x TV and solve the problem with backwards induction DeGroot 1970l j. if 
possible. This is because of the following fact: 

Theorem 1. When the opponent is Nature, the SRP is an MDP. 

Proof. We prove this by construction. For a set of reward functions 1Z, the 
state space of the MDP can be factored into the reward function and the state 
of the dynamics, so S = 1Z x Sq. If there are K reward functions, we can write 
the state space as S — (JifcLi &k- Let the action space be A, such and a set of 
bijections M^- : Si Sj. In addition, for any i, j all states s £ Si, the transition 
probabilities obey: P(s t+ i | s t — s,a t = a) = P(s t +i | s t = Mij(s),a t = a) and 
P(st+i 6 Sj | St £ Si, at = a) = q if j ^ i and 1 — q otherwise. It is easy to 
verify that this agrees with Def. [3J □ 



U nfortunately, the Bayes-op timal solution is usually intractable [DeGroot 



1970l iDufj . l2002i iGittind . [l989] 



3.2 When the opponent is adversarial 

We look at the problem from the perspective of Bayesian experimental design. 
In particular, the agent has a belief, expressed as a measure £ over Af. Then, 
the ^-expected utility of any policy 7r is: 

E^U= [ [ [/(s)dP;(s)d^). (3.1) 

JS" 

Let P* and be the probability measures on S* arising from the optimal policy 
given the full CMP v and given a particular belief £ over CMPs respectively, 
assuming known payoffs p. The opponent can take advantage of the uncertainty 
and select a payoff function that maximises our loss relative to the optimal 
policy: 

4(£, = max / (P* - P*) dA fe . (3.2) 
A Js* 

This implies that the opponent should maximise the payoff for sequences with 
the largest probability gap between the v- and ^-optimal policies. To make 
this non-trivial, we have restricted the payoff functions to A(<S*) < 1. In this 
case, maximising £ requires setting X(B) = 1 for the set of sequences B with 
the largest gap, and everywhere else. This the second type of sparseness that 
SRPs have. 

We now show that in a special two-stage version of our game, a strategy that 
maximises the expected information gain minimise s a bound on our expected 



loss. First, we recall the definition of Lindlcy 1956 



Definition 4 (Expected information gain). Assume some prior probability mea- 
sure £ on a parameter space Af, and a set of experiments V , indexing a set of 
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measures { Pff \ v G TV", n G V } on X. The expected information gain of the 
experiment tt consisting of drawing an observation x from the unknown P* is: 

0(tt,O=/ / &^MdPZ(x)dZ(v), (3.3) 

J AT JX ^ \ x ) 

where P?{x) is the marginal Jj^ P£(x) d£(z/). 

In our case, the parameter space Af is the set of environment dynamics while 
the observation set X is the the set of state-action sequences (S x A)*. 

Theorem 2. Consider a two-stage game, where there for the the first stage, 
Pi = 0. Then, maximising the expected information gain, in sufficient to min- 
imise the expected regret. 

Proof. Through the definition of the stage k loss (|3.2p . and as X(S*) < 1: 
4(£, v) = max / {P; - P?) dX < max f \P* - P*AdX 

A JS* A Js* 

<\(s*) f \p:-p£\do-<\\p:-p£\\, 

Js* 



(3.4) 



where ||P|| = J \P\ da is the Li-norm with respect to a. If our initial belief is if 
and the (random) posterior after the first stage is £, the expected loss of policy 
7r is given by: E^, j7r ||P* — Pc\\. Finally, since for any measures P, Q it holds that 
2 JlnP/QdP < \\P-Q\\ 2 , we have: £(tt,£) > E f)7r |||PJ -P^Hf. Via Jensen's 
inequality we obtain that E 5i7r ||PJ - P^||i < y/2g(Tv,£). □ 

Thus, choosing a policy that maximises the expected information gain, min- 
imises the expected worst-case loss at the next stage. This is in broad agreement 
with past ideas of relati ng curiosity to gain ing knowledge about the environ- 
ment (e.g. work such as [Schmidhuberl . Il991 1 ) . Consequently, pure information- 



gathering strategies can have good quality guarantees in this two-stage adver- 
sarial game. 

For more general games, we must employ other strategies, however, as we 
need to balance information gathering (exploration) with obtaining rewards in 
the current stage (exploitation). Unfortunately, even finding the policy that 
maximises (|3.3j) is as hard as finding the Bayes-optimal policy. For this reason, 
in the next section we consider approximate algorithms. 



4 Algorithms 

We use two simple algorithms for SRPs, derived from two well-known strategies 
for exploration in bandit problems and reinforcement learning in general. The 
first, Upper Confidence bound SRP (UCSRP, Alg. [T]) chooses policies based 
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Algorithm 1 UCSRP: Upper Confidence bound SRP 



l: for k = 1, . . . ,K do 

2: Find the largest {M(Ar) | tt G V} s.t. P(3tt : Pr <£ K(P*) < l/k), 
3: Select 7Tfc G argrnax^. V,+ (pfc), from (|4~2|) . 
4: Execute 7Tfe, observe s, a; get payoff p(s). 
5: Update {Pn | 7r G V} from s, a. 

6: end for 



Algorithm 2 BTSRP: Bayesian Thompson sampling SRP 

1: Set initial beliefs £i(Ptt)- 

2: for k = 1, ...,K do 

3: Sample z> ~ 

4: Choose 7T fc = 7r*(i>, 

5: Execute 7Tfc, observe s, a; get payoff p(s). 

6: Calculate new posterior £k+i{-) — I s 7 a )- 

7: end for 



on simple confidence bounds, simil a rly to UCB Auer et al. , 2002] for bandit 
problems and UCRL Jacksh et al. . l2010j j for general reinforcement learning. 
The second, Bayesian Thompson sampling (BTSRP, Alg. Ep, cho oses a policy 
by drawing samples from a posterior distribution, as in lStrensT[2000l ] . To simplify 
the exposition, we restrict our attention to some arbitrary stage k and consider 
a setting where we have a finite set of policies V ■ 

UCSRP (Alg. [TJ uses confidence regions. An abstract view of the method 
is the following. For any policy 7r, let the empirical measure on S 
let: 



be P ff , and 



K(P V ) = {Q I ||Q — Aril <e} 



be a confidence region around the empirical measure, where ||P|| = 
the Li-norm with respect to a. Then we define the optimistic value 



Vj~ = max • 



(4.1) 
/ \ p \ dcr is 

(4.2) 



to be the value within the interval maximising the expected payoff. This can be 
seen as an optimistic evaluation of the policy n that holds with high probability 
and we choose tt^ G argmax„. V+. For RL problems, there is no need to evaluate 
all policies. The algor ithm can be imple mented efficiently via the augmented 



2010]. 



BTSRP (Alg. [2]) draws a candidate CMP Vk ~ from the belief at stage 
fc, and then calculates the stationary policy that is optimal for (vk,Pk)- At the 
end of the stage, the belief is updated via Bayes's theorem: 



J fl P y (s|a)^( y ) 
/^P,(s|a) d&(i/) : 



(4.3) 
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This type of Thompson sampling Thompson , 1933| performs well in multi- 



armed bandit problems [?], but its general properties are unknown. 

Lemma 2. Consider a payoff function p with corresponding payoff measure A. 
Assume that e is such that confidence regions hold, i.e. that P n S Af e (P v ) for 
all 7r. For UCSRP to choose a sub- optimal policy n, it sufficient that: 



E(p | P*.) < E(p | P^ + 2 J cvdA. 



Proof. Since UCSRP always chooses tt maximising P+, if we choose a sub- 
optimal 7r then it must hold that E(p | P^*) < E(p | P+). Since the confidence 
regions hold, E{p \ P„.) < E{p | P+), E(p | P w .) < E(p | P+) and E(p | 4) < 
E(p | P^ + c ff ). Consequently: 

EO | P».) < E(p | P+) = /(Ar + Cr) dA 

< /( P t + dA = E (/° I ^) + 2 / c ^ dA 

□ 

Theorem 3. Let c„ t k &e i/ie relevant signed measure for policy ir in stage k. 
Assume that 3a, b > s.t. \\c<n,k\\ < an w\> with n -n k = Sj=i M 71 "! = ^l- 

Proof. It will be convenient to use p'w = V(p, 7r) to denote the value of tt for 
payoff function p. This has the usual vector meaning. Let p = (pu) be a 
sequence of payoffs and let ir k 6 argmax 7rg7 > p' k Tt be an optimal policy at stage 
k in hindsight, and let TXk be our actual policy for that stage. Then the regret 
after K stages, Ck, is bounded as follows: 

K 



C K < max V (p' k TT* k - p' k TT k ) 
p A — ' 

k=l 

= max^] p' k ir* k — p' k ^ 7rl{7r fe = 7r} 



p 

if 

< V V max I {ix h = tt} p' fe (7r^ - tt) 
' — ' ' — ' Pfc 

7rePfe=l 

if 

< ^ 51 maxl {e-^fc > Ar,*} Ar,k 

ir£V k=l 



where e*,* = 2||c n -,k||A fcl i, Ar,k = P^ 7 ^ ~ 7r )- 1=1 

The actual shape of the confidence region, for UCSRP, and the belief, for 
BTSRP, depend on the model we are using. In general, they have the form 
Ci = anf b , where rii is the number of times the i-th policy was chosen and 
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a > 0, and b > 1, but can be tighter if there is an interrelationship between 
policies. 

In both cases, a new stationary policy is selected at the beginning of each 
stage. We consider two types of games, for which we employ slightly different 
versions of the main algorithms. In the first game, each stage terminates after 
the first action is taken. In the second game, each stage terminates only with 
constant probability q at every time-step. 



5 Experiments 

We consider games having a total of K stages. In each stage, the agent observes 
an payoff function of the form Pk(s T ) = 5Zt=i r ( s t) an( i then selects a policy 7Tfc. 
The environment is Markov, and the stage terminates with fixed probability q, 
known to the agent. 

Confidence intervals for UCSRP can be constructed via the bound oflWcissma n et al 



2003f on the L\ norm of deviations of empirical estimates of multinomial dis- 
tributions. In order to construct an upper confiden ce bound policy efficiently, 
we employ the method of UCRL Jacksh et al. , 201d| , This solves an augmented 



MDP where the action space is enlarged to additionally select between high- 
probability MDPs. This guarantees that the policy acts according to the most 
optimistic MDP in the high-probability region, as required by UCSRP. 

For t he BTSRP polic y, we maintain a product-Dirichlet distribution (see for 
example DeGroot 197fj| |) on the set of multinomial distributions for all state- 



action pairs and a product of normal-gamma distributions for the rewards. We 
then draw sample MDPs by drawing parameters from each individual part of 
the product prior. 

5.1 Opponents 

For reasons of tractability, and better correspondence with the reinforcement 
laming setting, the opponents we consider consider only additive payoff func- 
tions, such that the same reward r(s) is always obtained when visiting state 
s and the payoff of a sequence of states si, . . . , s* is simply p(s\, . . . , St) — 
Sl=i r ( s *)- We consider two types of opponents, nature, and a myopic adver- 
sary. 

Nature. In this case, the reward functions are sampled uniformly such that 

Adversarial. This opponent has knowledge of is, and also maintains the em- 
pirical estimate v. Assuming that the agent's estimates must be close to the 
empirical estimate, the payoff is selected to maximise the stage loss (|3.2I) . This 
is a sparse payoff, as explained in Sec. 13.21 based on the empirical estimate 
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rather than the (unknown) agent's belief: 



p k £ argmaxV^p, n*(v, p)) - V(p,n*(i>, p)). (5.1) 
p 

5.2 Results 

The results are summarised in Fig. [T] These also include a greedy agent, i.e. 
the stationary policy which maximises payoff for the current stage in empiri- 
cal expectation. In both cases, the regret suffered by the greedy agent grows 
linearly, while that of BTSRP and UCSRP grows slowly for adversarial oppo- 
nents. However, when the opponent is nature this is no longer the case. This is 
due to the fact that the distribution of payoffs provides a natural impetus for 
exploration even for the greedy agent. 




200 400 600 800 1000 200 400 600 800 1000 

(c) Nature, \S\ = 4, \ A\ =2,q = 0.5 (d) Nature, \S\ = 8, \A\ = 4, q = 0.1 

Figure 1: Comparison of the expected cumulative regret after k stages, averaged 
over 10 3 runs, between BTSRP and stagewise stationary greedy policies, on 
randomly generated MDPs. Against nature, none of the policies suffer a large 
amount of regret. Against an adversarial opponent, the greedy policy suffers 
linear regret. When the opponent is nature, the exploring policies enjoy no 
advantage. 



11 



6 Discussion 



We introduced sparse reward processes, which capture the problem of acting in 
an unknown environment with arbitrarily selected future objectives. We have 
shown that, in an two-stage adversarial problem, a good strategy is to maximise 
the expected information gain. This links with previous work on curiosity and 
statistical decision theory. In fact, the c onnection of in formation gain to multiple 
tasks had been arguably recognised by iLindlev 1956 1 



. . . although indisputably one purpose of experimentation is to reach 
decisions, another purpose is to gain knowledge about the state of 
nature (that is, about the parameters) without having specific ac- 
tions in mind. 

We have evaluated three algorithms on various problem instances. Overall, when 
the opponent is nature, even the greedy strategy performs relatively well. This 
is because it is forced to explore the environment by the sequence of payoffs. 
However, an adversarial opponent necessitates the use of the more sophisticated 
algorithms, which tend to explore the environment. This is partially explained 
by r esults in the related s etting of multi-armed bandit problems w i th cov ari- 
ates IPavlidis et all (20081 ]. IRigollet and Zeevil |2010j ]. lYang and Zhul [20021 ]. ?. 
There, again the payoff function is given at the beginning of every stage. In that 
setting, however, the opponent is nature and, more importantly, the only thing 
observed after an action is chosen is a noisy reward signal. So, in some sense, it is 
a hard er problem than the one considered herein (and indeed IRigollet and Zeevi 
(2010} prove a lower bound). The one-armed covariate bandit ? for an exponen- 
tial family model, and proves that a myopic policy is asymptotically optimal, in 
a discounted setting. This ties in very well with our results on problems where 
the opponent is nature. 

Finally, SRPs are r elated to other multi-task learning settings. For exam- 
ple [Lugosi et al. , 20081 ] , consider the problem of online multi-task learning with 
hard constraints. That is, at every round, the agent takes an action in each 
and every task, but there are some constraints which reflect the tasks' sim- 
ilarity. Somewhat closer to SRPs is the game-theoretic setting of ?, where 
again the agent is solving a multi-objective problem where the goal is that 
a reward vector approaches a target set. Final ly, there is a close relation to 
the p roblem of learn i ng wi th multiple bandits Dimitrakakis and Lagoudakis . 



2008L iGabillon et all 1201 lj . Essentially, this problem involves finding near- 
optimal po licies for a number of possibly related sub-problem within a search 
bu dget. In Gabillon et all 2011 1 the ta sks are unrelated bandit problems, while 



Dimitrakakis and LagoudakisL 12008] the tasks are actually different states of 



a Markov decision process and the goal is to find the best initial actions given 
a rollout policy. 

Finally, our experimental results show that a greedy policy is a good strategy 
when the payoff sequence is (uniformly) stochastic. This naturally encourages 
exploration, even for non-curious agents, by forcing them to visit all states fre- 
quently. UCSRP and BTSRP, which explore naturally, perform much better 
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for adversarial payoffs. Then the greedy player suffers linear total regret. Con- 
sequently, we may conclude that curiosity is not in fact necessary when the 
constant change of goals forces exploration upon the agent. 
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